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lESTRACT 

The educational significance of wrong answers on 
multiple choice tests was investigated in ever U,000 subjects, aged 
to 20. Gorham's Proverbs Test--which requires the interpretation of 
proverb sentence — was administered ana repeated five months later. 
Four questions were addressed: (1) what can the pattern of answer 
chcica^ across ag^, using ' frequencies cf response as the raw data, 
indicate abou,t the psychometric properties of learner development; 
(2) what process/product inferences night be drawn from these 
cutcomes; (3) is the total correct score adequate fcr evaluating 
achievement; and (i* ) is the cumulative learning hypothesis 
valid — this hypothesis implies that the prircipal source of meaning 
,is found in the frequency of right answers, or in interactions 
between right answers. Analyses considered various interactions 
between two right answers, one right answer and the wrong answer in 
another item, wrong answers in both items, ecuivalent ages, and 
equivalent student groups. The results suggested that learning 
involves a complex hierarchical sequence of interactive ncn-linear 
events; analysis of error patterns is acre meaningful than frequency 
counts; total correct scores are only adequate fcr evaluation of 
simple recall; and the cumula^ ive lea rring hypothesis is not valid 
for complex cognitive processes- (GDC) 
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Answers on a multiple cbo5 . r st -«re treatesi as categorical 
events and then plotted >crri5s a broao spectrum sample using a 5 
month Interval on age of subject. Usi^ng the assun^tlon of homooeoelty 
to remove the Impact of seloct nft pnaDcorraons cell Chi -Squares were 
obtained for alternative by ala=maiLr« - ''ationships to detemnrse 
the presence or absence of inc- - t nxeractlons. ^out 

10 percent of the ntcractlons .mergeca:; "meanincfuT* at this level. 
For R*R (right by right answer irrti5r7^==^ions,obsEr-vcd frequencies were 
always higher than expected anc ^::t»^ interac=:^ons were In t4»e same 
direction within the 5 items stLuw- The oppos :5 being true or 
W*R Interactions. Both R*R ;:3>d w^V ^ractiont .:r=curred Ir' :caBady 
greater proportion than W*R -^tei ^ ^.r--. a«d W*W rasractlonB ^ 
a nearlv significant greate-nr -ir Ion '.>ccurMf<.-s^ ^an R*R fcTi»i*ctions . 
Also R*F interactions tende a. .oncerrtrate^'. aL Lnc low end o^f,' arte 

age : g with W*W interatct i.a^«>* -^neral i > r paszr ihar :the. other s 

wroughc the rest of the ■v^ , ob^erva- ^r. Is opoo«^te to 

tteanu' X)rted by Bock [\57S} t '-v =^ tcrr-OTt rat ha--. 

rtterac^^pn? f^^-particular , . snisfiareir to be lr«jaed 

...•iJhin c ' ' 'our apparent a ' divisions of tiie rang? or^djaccnt 

«;-ls. ntcssctlons betw ^ f^-.ce clearly changed «i r,H^e, 

i.,..^ying-^ a single key ^^iCiCDriai syst^n for all answers J?ccri>. ^ all 
^levels may be Inappropris inrerpretlaq these patp^rni, 

P^\rs at .ne same age were sigi'^J f csant^ more frequent in twaber than 

Bto4uences with the same group Mt^^stJ"? age to be the more lapcrtant 
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Abstraiet cont'cri 



Avelopmental fact. jr. The assumption tk^ tihe distr^isurtaffr of these 
events might be rs»c>dk)m ves not ^-upporttsd ()r=^7-33) coy 3i«5e data. 
:£ogn1tfve develo^rnr, as reveled ^y/ '±ils pnxedure z::.pessrs to be 
OHnplexly Interac^-^' K% and non-i»lnear in nature. These nrwfirngs appear 
to be generally o^. • istant whEn ofhcir *Vrong «nswer" reswae-ch, A 
DiUin for continui^Tg trie exploratfiW of these relationshi-rvs. was 
moposed and an'r^v! tatbon tc partictiiate wa^ extended. 
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Can Developmental Status 
Information be Obtained from 
Wrong Answers? 



INTRODUCTION 

This present paper represents the culmination of more than 10 
^rs of exploration into the educational significance of wrong 
^wers on multiple choice tests, (See: Powell 1968, 1970, 1976, 
1977, 1978a, 1978b, Powell and Isbister, 197'*). A number of 
:=antili2ing observations seem to have emerged in repeated studies 
with different tests and different populations. Most notable have 
been: 

1. Consistency of reasoning behind specific wrong answer 
selections has been repeatedly found, (Powell 1968, 
1977) . 

2. Distinctive patterns among wrong answers independent of 
right answers has occurred consistantly, (Powell 1968. 
1970, 1976, 1977, Powell and Isbister, \Slk) . 

3. Wrong answers seem to produce »feetter" predictors of 
independent achievement measures than right answers 
with different tests and populations, (Powell 1970, 
1976) . 

Curved line patterns among wrong answers across age 
have been found using quite different analytic 
procedures, (Powell 1976, 1978a). 

5. A strong developmental trend among wrong answers 
(Rho=I.OO) has been found, (Powell 1977). 

This confcination of results raises some interesting questions 
about the current practice of using Total Correct scores for 
achievement assessment as the role criterion of success. 

The use of Total Correct scores for achievement assessment has 
a long and honourable history. It built upon a perfectly resonable 
model and has had considerable support from the results X>f statistical 
procedures derived from the model. Th.: fact that educational research 
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has generally tended to be non-conclusive, (See: Walker and Schaffarzlk. 
197*), has only spurred renewed vigor in pursuit of the confounding 
varl^les which produced these results. 

Figure 1 below gives the behavioral assumptlcins and the 
mathematical equivalent of taese which together formr the basis for 
classical test theory. 

FIGURE 1 

PRINCIPAL ASSUMPTIONS IN 
CLASSICAL TEST THEORY 



Behavioral Assumption 



Mathematical Equivalent 



U THE KNOW-GUESS HYPOTHESIS 
The learner either knows 
the answer or guesses 
(blindly) . 

2. ThE CONTINUOUS LEARNING 
HYPOTHESIS 

Learning Is continuous and 
cumulative. (Graphical ly 
learning would appear as a 
straight line when age 
and achievement are equated) 

3. THE LINEAR MEASURABILITY 
HYPOTHESIS 

Achievement Is linearly 
measurab le. (Departures 
from the straight line 
In # 2 are the product 
of meaningless measurement 
error and other random 
events •) 



1. ^-ij"^ ; i»e. the answer 
given by person i on Itemj 
Is a 1 for a right answer and 
a 0 for a wrong answer* 

n 

2. Xi^ZXiy^ I.e. the score of a 
peHs'on on a test Is the sum 
of all the Ts and O's from 
the Items on the test. 



3. X^'Wi ■¥Ei,\ I.e. the observed 
score (^^) Is a linear 
combination (sum) of the 
person's linear true score 
(T^) and an error of 
measurement iMi) • 



thret bcssfc assiiBOticcvis dissented in Figure \ are the 
fundwsnxal fcnuisations beh^^ cla^ical test theory. 

'^♦w first DMBLof Uie.v4Lv ^^ sore important than the Third, 
becaune^ altematti^s: tnea«7(maw» llty models which are non- l inear In 
nature -^ould be — ?irmuld» «i: ' ^ :he ncsd to do so were recognized and 
estabt staed. 

! Trrde-r to cssai>'fmtte!r ^\ica\ test theory, Dien, "m would be 
irecesss?^ to refutac "*'he rrr.' ft*© ivypothes i 5 • These two hypothesis 
cannot i> " '^allengec jpan~ matvb^en^ical or logical grounds. In order 
to succcissifttny -re^fui^a t' t-rusx be conclusively demonstrated in 

behavior«-^ terms. tV t 1? -nftr^zdo not always behave as taesc 
assumptiocr indic/ite . 

The ^dence. rt ortec aifci'p^ve is strongly suggestive that both of 
these assaroptioRi; «dy bt Jse.^. However, here-to-fore the evidence has 
been coriCL^usive ortiA or assfa**nijt i on # 1, the "Know-Guess" Hypothesis, 

Trsang e£ wrong 5r:.w*«=T as a zero (O) ; or by using formula 
scoring^ ~s mathfem«t i -a 1 ) y Jtvalent to assuming that wrong answers 
are "blTnz:f guesses. That these answer selections could be 
just as t^i ly ac iieved by f pplng coins, rolling dice, etc, as 
by re*dinm the c***stion anr Che proposed answer set. 

S3^fw:d in Lway, ti-e ioDow-guess hypothesis Is earsily test- 
able, ctly r andom wrong ssiswers will be distributed about 
equal r^all Mc^ing alternatives. It has long been known that 
such :r5but on rarely, if ever, occurs. This observation 



Is sufficient to refute the ••knovr-guess'' Hypothesis. More complex 
approaches which address the several s«r-pj:ob 1 cms connected with tnis 
hypothesis can »lso be fermulated and te^d . It Is probably safe 
to say that the "Kncni Tim j" hypothesis a& formulated above h^ n»v 
been conclusively reftttait (See: Powell jbic isbister }37k, anc *owell 
1978). Refutation of 1^ hypothesis, -no^iwier, may be of no consequence 
If the only raees-ingfui Information with T=spcc' to achievement in a 
question Is to fmmmi rn the right an»«ers 

Another -m^Men takes an altemat^^« hewioral view. In 
this approad-. ^ ^opec by Shuford Albert grtd Massengill (1966) 
the assumptriDT ^ that the respondant S expected to know some- 
thing about q>lr :ions. In so doing, wcnjhts can be assigned to 
all alternat--^s i rder of their likclih.Toai: of being right. In 
this way. p rcial -fonaation can be acconr :-'d5zed . Partial information. 
In this apiw«ch is. assijDcd to increase t ^ Kelihood of getting 
the answer the c estion right. This -jmptlon Is also of 
dubious van<^.ny. ^theory, apparently -.nslders the role of 
"mlslnforma«»^" iir answering. 

Hakst^;;. and Kansup (1975) tested =£^*de range of approaches 
to this pt»nsm and concluded that no benefits accured from any of 
several approaches they tested. They defined "benefit" in terms of 
Increased reliability of test scores by adding any of several possible 
weighted continatlons of wrong answers to the Total Correct sc^re. 
A third approach has been to determine from self -report the 
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respondant's reasoning behind choosing wrong answers* Powell (1968) 
showed that this reasoning was consistant within wrong answer factors 
and cross-validated about two-thirds of the time. In a 

later study (Powell 1577) with the same test and different age 
groups the validation leveDs were between «50 and .60. These 
findings Imply that th» selection of wrong answers may be systematic 
rather than random. Smdn a conlusion ts consistant with the implied 
refutation of the know-guess hypothesis, but inconsistant with the 
findings reported by Hakstian and Kansup. 

However, Hakstian and Kansup tried to find some method of adding 
a weighted combination of wrong answers to the Total Correct score. 
They examined the Impact of their attempts upon the reliability of 
the test scores. In this respect, they used assumption 2 and 3 to 
help support assumption 1. This approach to scientific procedure 
is incorrect. Support or refutation of the fundamental assumptions 
in a theory must be achieved upon a "stand alone" basis. The reason 
for this problem is that if learning is NOT a linear phenomenon, 
then the adding of a non-random distribution of wrong answers would 
produce either null or i neons is tan c results. The failure Hakstian 
and Kansup to find an additive approach which improved reliability 
could be a product of non-linearity. 

To give a bit of background, it would, perhaps be helpful to 
illustrate how learners apparently select answers on multiple choice 
achievement tests. 
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CHILDREN'S «WM>NG AND WRONG ANSWERS 

The easiest method efiif sswrrfylng the problem of the nature of 
answer selection Is to beiyin mJin an example. Figure 2 provides this 
Illustration. 

INSEHCT TBCIWE 2 ABOUT HERE 

In Figure 2 alterna^i^ C was most commonly selected by the 8 
year olds In the study. There seems to be no logical relationship 
between "dulckly Come, QuifckW Go" and "Always do things on Time". 
However, when the ccmmoinly reported reason ''That* s what teacher always 
says" Is taken Into acaaoit the logic is inmediately evident to anyone 
familiar with a typica^ Grade 3 classroom. These children have interpreted 
"Quickly Come, Quickly 3d" as a description of their personal class- 
room behavior- With several instructional groups and the complex 
interlocking and overlapping timetable of activities which this class- 
room management system necessitates, "Quickly Come, Quickly Go" and 
"Always do things on time" describes very succinctly what life In such 
a setting must be like for the young learner. In dealing with this 
proverb within the framework of their experience, these children 
have given a perfectly valid answer which is '*wrong" only because 
It is not Included In the scoring key. Better than 50 percent of 
the children selecting 18c reported semantical ly equivalent reasons 
to this one. 

Of course, Plaget's work has long demonstrated that children and 
particularly young children reason differently from adults. If this 

10 



FIGURE 2 



EXAMPLE INTERPRETATION 



PROVERB: 



QUICKLY COME, QUICKLY GO. 



jEasy Come, Easy Go)* 



TRANSLATIONS: 

a. ALWAYS COMING AND GOING AND NEVER SATISFIED. 
Characteristic of 13 year olds. 

''You Should Stick To A Job Til It's Finished." 

b. WHAT YOU GET EASILY DOES NOT MEAN MUCH TO YOU. 
Characteristic of adults. 

c. ALWAYS DO THINGS ON TIME. 
Characteristic of 8 year olds. 

"Thot's What Teacher Always Says." 

d. MOST PEOPLE DO AS THEY PLEASE AND GO AS THEY PLEASE. 
Cnaracteristic of 10 year olds. 

"It Talks About Coming and Going." 



*ltem No. 18 from Gorham, Donald R. Proverbs Test , Psychological Test 
Specialists, 1956. Reproduced with permission. 
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were merely an Isolated event, then It could easily be resolved by not 
using this Instrument In settings where unexpected or at least afternatlve 



course, would be to determine how to Identify when children were 
responding In this way to this and other similar Items, and to use 
this Information to Identify the problem solving approaches used by these 
children. This suggestion proposes an Intriguing alter. lative approach 
to testing. Do the '"wrong" answers reveal how the respondants are 
attacking the problems? 

If the 10 year olds were looking for a word-for-word (literal) 
translation of the proverb, then their typical choice (l8d) as reported 
In Figure 1 makes sense as well, and similarily if the 13 year-olds 
are searching for simple (linear) cause-effect relationships — so does 
their typical choice. (l8a) Their choices may be the results of the 
way they attack and interpret the problem. In problem solving terms, 
these differences In selection seem to be related to a combination of 
frame of reference and solution of strategy. Each age group framed 
the problem differently, and the 13 year-olds added extraneous 
Informat i on. 

Using the statistical clustering of items with common modes to 
produce ''homogeneous'* answer subsets and the reported reasons to 
classify these subgroup the present author identified I'* subgroup of 
answers 12 of them among the wrong answers. These, in aggregate, 
accounted for 151 of the 160 alternatives in this ^ item test. The 
Internal consistancy of the test increased from r».76 to r-.?** with 
this approach. When thesu \^ subsets of the answers were ordered 



Interpretations of these Items were likely. Another approach, of 
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using the Simplex procedure, the resulting scaling perfectly re- 
capitulated the age sequence of the subtests by their modes of selection 
(Rho-1.00). In addition, the logicof the answering followed Piaget's 
stages except that Concrete Operations seemed to have two substages. 

Thus it would appear from this behavioral description which 
emerged from the reasoning behind answer selection, that at least for 
items of the type present in The Proverbs Test , the "Know-Guess'* 
hypothesis may not be an appropriate behavioral description. Instead, 
answer selection seemed (as also occurred in the earlier study; 
Powell 1968) to be related to the world views of the respondants; 
In Piaget's terminology, to the schemata they hold. 

The second approach (also used elsewhere) to refute the "Know- 
Guess" hypothesis was statistical and was also attempted upon a 
"stand alone" basis. The results are probably conclusive that the 
"Know-Guess" hypothesis is false (See: in particular, Powell and 
Isbister, \S7k) . Given an opportunity to make a reasonable decision, 
apparently most people will not respond "blindly" to multiple 
choice achievement test items. 

The details of the study concerning children's reasoning just 
siinrT!ar:2ed are reported elseshere. (Pc.-.'en 1977) 

A SEQUENCE OF DEVELOPMENT 
FROM WRONG ANSWERS 

In a general way the study just reported above has already 
demonstrated a strong developmental trend among wrong answers. 
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But this observation raises more questions than It answers. If 
such a powerful developmental trend Is present among wrong answers, why 
has It not shown up before? 

In most approaches to testing, the right answers are considered 
to be the main If not the only source of Information about the learner 
and to form a straight line scale. This approach Involves assuming 
learning to be cumulative, (Ke, assumption # 2). These assumptions 
Justify the counting of right answers and/or the addition of subscores 

to form subtest scores and the Total Correct score. The Total 
Correct score or some linear transformation of it is often the sole 
basis for evaluating learner progress. A good example of this 
approach Is the Peabody Picture Vocabulary Test, which is a broadly 
used standardized test of receptive vocabulary. The norms, on this 
test, however, are not presented in terms of vocabulary development 
patterns, but rather as deviation Intelligence Quotients. 

If the relationships among answers are not straight lines, then 
much more complex approaches to tests than classical test theory 
may be In order. At this point there are two possible approaches. 
First Is the "arm chair'* approach in which a series of assumptions 
are made and a mathematical model built and tested, often with 
simulated data. This approach requires much more mathematical 
eWtii than the present researcher possesses. A second approach is 
to collect a large sample of data and to explore these data for their 
structural and relational properties i.e. to develop a "grounded 
theory". (After Glaser and Strauss, 196?). 
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Making the smallest possible number of assumptions forces the use 
of all data as frequencies only. In this case, the behavioral 
assumption made must be that the rcspondant assigns categorical 
values to each alternative in each Item. In this case alternatives 
should be statistically related, either as artifacts of rahdcm 
variations, or on the basis of similarities among assignment pro- 
cedures. 

The most reasonable method to use Is the Chi Square (x^) procedure, 
which though robust, Is less sensitive than procedures which make 
more assumptions. In addition, on an alternati ve-by-alternati ve 
basis, many of the events examined will need to be cell Chi Squares. 
In the absence of critical values for zero degrees of freedom, cell 
Chi Square values present an Important Interpretation problem to the 
use of the Chi Square technique In this study. Furthermore, the 
four alternatives In any one Item can be compared In several ways. 
Each approach produces a different contingency table from these same 
data. They can be compared In pairs of Items either across the entire 
sample or by using subgrouping from wJthIn the sample. 

Another approach Is to conpare an Item with Itself between 

reasonable subgrouplngs of the sample such as at different age levels, 
or between sexes or different adminlstrat Ion times. 

in addition to the possibility of formulating the contingency 

tables In several ways, different assumptions can be tested with each 

table. There are three common approaches. Random, homogeneous, 

and/or external model assumptions can be used. 
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In the random event approach, each cell Is expected to occur with 
the same probability or with a distribution reflecting the properties 
of the ''normal curve". In this case the expected frequencies used 
\n each cell within one table are Identical to each other or related 
to the area under the curve. Using equal cell frequencies Is pointless 
here, because the "Know-Guess" hypothesis has already been refuted. 

Under the homogeneous assumption, only the marginal proportions of 
these events are assumed to be meaningful. This is the approach used 
In this study. 

In the third case, a model which is externa! to the contingency 
table U tested with the table for goodness of fit. An approach 
to external model building called MULTIQUAL was developed by Bock 
(1973) can be used to compare patterns among cell frequencies with 
some form of external mathematic model. Figure 3 illustrates the 
outcomes of this last approach. 



INSERT FIGURE 3 ABOUT HERE 



in summary, Figure 3 shows the testing of all possible straight 
line formulations of an age sequence of frequencies within one item. 
Among these straight Ur.£ luodcls the only one vs-hich cofr^es dosft to a 
fit includes the regression line for the right answers and the average 
proportions of selection among the wrong answers. Since the average 
proportions are apparently a necessary component of the model, these 
events are not random, further negating the use of the random 
assumption In this present study. However, even this fails to fit 
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FIGURE 3 

DlSTAleiTlON OF fif SP0N6ES TO ITEM ONE 
PITTED it SEVERAL POSSIBIE MODEL THEORIES 



a. OBSERVED DISTRIBUTION b. MODEL 1: ALL ANSWERS 

RANDOM 

R 



.50- 



0 
H 
ti 
tt 
0 
tt 
0 
tt 



..00 




.50 



I I I I .00' 



c. MODEL 2: AVERAGE OF 
RIGHT ANSWERS 
MEANINGFUL 



.50 




FAILS 



All Answers 



T r -oo' 



d. MODEL 3: REGRESSION 
ON RIGHT ANSWERS IS 
MEANINGFUL 



FAILS 



..ML. 

Wrongs 



1 r 



-*R '^^ ' 



T .00 




FAILS 



9 10 11 12 9 10 11 12 9 10 11 12 
AGE AGE * <5 E 



T i 1' ' T- 

9 10 11 12 
A G E 



RIGHT ANSWERS 



****** l«MM< 



WRONG ANSWERS 



«. HODEL 4> REGRESSION 
WITH ONE WRONG ANS- 
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f. MODEL 5: REGRESSION g. MODEL 6: REGRESSION 
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h. MODEL 7 1 LINEAR QUAD- 
RATIC AND CUBIC FUNC- 
TIONS COMBINED 
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to th© p > .05 criterion. The MULTIQUAL formulation (MODEL 7) Includes 
both quadratic and cubic functions alom$wIth straight lines before 
a fl t Is achieved* 

At least for this one Item of the Proverbs Test (Gorham 1956) and 
for these h ages, (9 through '2 inclusive), the patterns of answer 
selection among all alternatives are apparently not straight lines. 
This same test with a different (larger) sample was used In this 
present study. Making some assumptions discussed later, the original 
data validated by replication but the MULTIQUAL model did not. The 
findings Just sunmarized here are reported in much more detail else- 
where (Powell, 1978a). 

The finding of curved line relationships among answers in this 
Item supports findings reported elsewhere, (Powell 1976, Yu 1977), 
the refutation of the "Know-Guess" hypothesis, and implies possible 
problems with the "Cumulative Learning" hypothesis. However, this 
later study (Powell, 1978a) does not deal with the "Cumulative 
Learning" hypothesis on a "stand alone" basis. 

In order to refute the "Cumulative Learning" hypothesis, U 
would be necessary to approach it in a situation uncontamlnated by 
other influences. The procedure finally decided upon. Involves 
removing the Influence of the aggregate frequencies upon events and 
examining between alternatives among Items, interactions for systematic 
events. 

The purpose of this present study then. Is to attempt to refute 
the "Cumulative Learning" hypothesis on a "ststui alone" basis. 
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Refutation of this hypothesis, would be necessary and sufficient to 
refute the use of classical test theory as an approach to the evaluation 
of learning progress. 

THE SAMPLE USED 

Since It had become evident that meaningful Information about the 
development of cognition might be found among wrong answers, a major 
study has been mounted involving more than ^,000 children in the age 
range from about 7 years to over 20 (grades 3 to 13 Inclusive). The 
test (Gorham's Proverbs Test) was administered In conjunction with a 
personality test, and both were repeated after a 5 month interval. 
Because of the reading level the personality test was not used with 
Grades 3 and h. The distribution of ages (condensed into 5 month 
intervals) for the Proverbs Test is given in Table 1. 



INSERT TABLE I ABOUT HERE 



The purpose of aggregating to 5 month intervals is two-fold- 
First, It makes a satisfactory minimum sample size in all but h of 
the 60 groups. Second, since, the second administration was 5 months 
after the first, comparison between two groups of the same age 
range assures independence of group membership- Also, comparison 
between October (N) and March (N + 1) age levels gives a sequential 
comparison of the two administrations with largely the same subjects 
in each group. 
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TABLE 1 



DISTRIBUTION OF SUBJECTS 
IN THIS STUDY BY AGE LEVEf. AND THE TIME 
OF ADMINISTRATION OF iriE TEST. 



AGE A^E AGE OCTOBER MARCH TOTALS 

LEVEL iN IN ADMIN- ADMIN- 

HONTHS YEARS I STRAT I ON I STRAT I ON 



1 


AIH 


> 


96 


2 


9 96 




100 


3 


101 




105 


/> 


106 




110 


5 


111 




115 


6 


116 




120 


7 


171 


_ 


125 


8 


126 




130 


9 


131 




135 


10 


136 




\hO 


11 


141 




1^5 


12 


146 




150 


13 


151 




155 


14 


156 




160 


15 


161 




165 


16 


• 166 




170 


17 


171 




175 


18 


176 




180 


19 


1 181 




185 


20 


186 




190 


21 


191 




195 


22 


196 




200 


23 


201 




205 


24 


206 




210 


25 


211 




215 


26 


216 




220 


27 


221 




225 


28 


226 




230 


29 


231 




240 


30 


240 


< 


AIM 



k3 3 *6 

8 68— MOSTLY 32 100 
70 SAME ^^55 125 

9 130 GROUP 53 183 
101 120 221 

10 127 78 205 
n7 101 238 
142 114 256 

11 145 118 263 
129 125 262 

12 165 106 271 
135 131 266 

138 100 238 

13 152 104 256 
114 132 246 

14 163 101 264 
262 150 412 

15 264 237 503 
258 247 505 
251 255 506 

16 249 228 477 
219 220 439 

17 210 219 429 
171 173 344 
186 130 316 

18 125 131 251 

87 81 1S8 

19 47 66 113 

20 20 43 63 
10 14 24 



TOTALS 4319 3676 7995 
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There s not lOO percent equivalence between the groups In these 
two sets s>*»De they represent administration to everyone In 10 schools 
In October jmd only 9 schools In March, One school voluntarily with- 
drew. It afso includes some who were absent In October, yet, were 
present In March, and so on. Person-by-person comparisons between 
administrations can be made from these data but were not made In 
the particular study reported here. 



The procedure of examining contingency tables for departures 
from homogeneity effectively removes the Impact of aggregate selection 
proportions. 

The '^Cumulative Learning*' hypothesis is used In a manner which 
implies that the principal source of meaning is to be found In the 
aggrega^it ^i^qil^ncy of "right'* answers. If interactions are ever 
consld e i^d^ these are derived fro.i the right answers only. These 
sub-tesr aggregates and whole test aggregates often become the sole 
basis far ach levement evaluation. 

Under the null hypothesis condition, then, departures from 
homoaenelty should be at best purely random and at worst should be 
a phenomenon confined mainly to right by right answer Interactions. 

If these null hypothesis are supported, then the right answers 
and/or subgroupings thereof would be necessary and sufficient to 
describe successful learning. If they are refuted then right answers 
and/or subgroupings thereof would not be sufficient^ to describe 



HYPOTHESIS TESTED 
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learning. In this latter case, since the events being examined are 
Interactive, this refutation Is by Itself sufficient to refute the 
•'Cumulative Learning" hypothesis. 

The reason why the finding of meaningful wrong-by-wrong answer 
Interactions refutes the "Cumulative Learning" hypothesis Is that 
such Interactions are, by definition non-linear. If right answers 
are not sufficient to describe achievement, because of meaningful 
wrong-by-wrong answer Interactions being present among these data, 
then the learning Itself must be non-linear and Interactive. 

If learning Is non-linear, then the "Cumulative Learning" 
hypothesis becomes Insufficient to describe learning events and falls 
as a model . 

If the "Cumulative Learning" hypothesis is refuted, then classical 
test theory becomes an Inappropriate model for achievement since the 
first term In the fundamental equation [X{^=T{^-fS{) ^ namely has 
been demonstrated to be Insufficient to describe learning by means 
of the refutation of the two hypotheses (*'*Know-Guess" and "Continuous 
Learning") upon which it Is built. On the basis of this reasoning, 
present approaches to testing would be expected to either stand, 
become qualified, or fall on the basis of this present study. 

PROCEDURL 

Using the breakdown given In Table 1 and the first 5 Items on 
the Proverbs Tesl , an Inter-Iteii comparison for each 5 month aggregate 
for each of the two admlnis rat Ions was obtained. With 10 Inter-Item 
comparisons,. 30 age levels, and 2 administration times; 600 four-by- 
four contingency tables were produced. Each alternative of one Item 

22 
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being cross-tabulated with eac^i alternative on each other items among 
all 5 Items. This approach examines about l/80th of the interaction 
data available from this ^ Item test. 

These 600 t^les were generated Including the cell Chi Squares 
under the assumption of homogeneity. In this assumption, the 
expected values arc generated on the basis that all meaning resides 
In the marginal totals. For Instance, If there are 180 students In 
the group, 100 get one Item correct and 120 get the other correct 
the the expected Joint occurance would be 100 X 120 * 180 » 66.67. 
If 80 students actually chose the right answer for both Items, the 
cell Chi Square would be (80 - 66.67)^ * 66.67 - 2.67. 

Using the frequency distribution of all 9,600 celt Chi Squares 
(divided Into 10 groups and averaged to get maximum stability) a 
distribution of the cell Chi Squares was attained. Comparison of 
this distribution with extrapolations from a Chi Square .able, a 
critical value x^^Z-^ was obtained. Any cell Chi Square in this range 
was considered to be »»meanlngful" . About 10 percent of the cell 
Chi Squares fell in this category. 

These values cannot be called "significant" since the mathematical 
distribution for Chi Square with zero degrees of freedom is not 
svsllable, hence the use of the alternative term "meaningful". 

In the above example, the 80 students should, collectively, be 
considered to have selected both correct answers jointly at a 
"meaningful" level (0>E) . That Is, the observed frequency Is 
"meaningfully" (rather than significantly) higher thair expectation. 
If only kO had chosen these two items jointly then the observed 
frequency would be "meaningfully" below expectation (0<E) . 

' 9r> 
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In this case It Is more likely for these students to get only one of 
these two Items right than to get both right. The thlr^ case where the 
cell Chi Square Is <2.k Implies that the Joint occurance (both Items 
right) Is determined by the proportional tendencies to get either 
right Independent of the other rather than upon some systematic 
property of the behavior of the respondants with these two Items In 
Interaction (O-E) . In this latter case, such meaning as occurs Is 
to be found among the frequencies rather than the Interactions. 

IMTERACTIVE RELATIONSHIPS EXAMINED 

There arc nine possible relationships among all the alternatives 
on multiple choice achievement Items. 

1. Between the two right answers (R*R) 

2. Between one of the right answers and the wrong 
answers In the other Item (W*R) 

3. Between the wrong answers In both items (W*W) 

These three can be tabulated across the 0>E, 0«E and 0<E 
relationships. It should be noted that these categories are different 
from those assumed on a personality test where no answer Is considered 
to be correct and between Item dependencies and/or within Item scales 
may be deliberately included when the test is constructed. 

For a h alternative per Item test there will be 1 of the (R*R) 
class; 6 of the (W*R) class; and 9 of the (W*W) class In each Item. 
In theory then, If Interactions arc purely random, then the equation 
R*R^**R-W*W should hold true In all but about 30 of the 600 tables to 

be exsnlned. Also If all events arc assumed to be about 2.5 percent 
win be of the 0>E type, the same for 0<E and 95 percent will be of 
the O-E type. If actually meaningful results are being derived from 




18 



this study then systematic departures from these patterns should be 
evident. 

With this Information the null hypothesis can now be expressed 
{n mathematical terms* 

NULL HYPOTHESIS: 



f(0>E)R*R = 


f(0<E) R*R 


Nl 


Nl 


- f(0>E)W*R - 


f(0<E)W*R 


N2 


N2 


- f(0>E)W*W =« 


f(0<E)W*W 






- .025 





Where the values are the maximum possible number of these 

types of interaction whose frequencies are described in the numerator. 

Support for Hq implies that the only source of meaning in the test 

under study would be the frequencies of answers. In this case, current 

practice would probably be supported. 

ALTERNATIVE HYPOTHESIS 1: 

H,,; f(0>E)R*R ^ qoc ^ 
" H] ^ N2 

H.^: ffO>E)W*W f(0<E)W*W _ 

^-N^ — " ^ 

In this case, right answer aggregates and approorlate right 

answer subtests built from the R*R Interactions would form the 

necessary and sufficient summary of achievement status information. 
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All possible alternate hypothesis can be built In this manner. 
Host of the remainder, however, Imply that right answers are not 
sufficient to account for achievement status. The only exception to 
this statement would be the case where R*R and W*W are random but 
W*R shows the right and wrong answers polarizing. This latter would 
express the '1.1 near Dependency'* proposition which Is a logical 
deduction from the KnowGuess hypothesis and which has already been 
refuted elf«wherc (Sec: Powell and Isblstcr, 197^). It should be 
noted, that these refutations have been upon behavioral rather than 
mathematical terms. The statistical relationships found among live 
data do not aupport the observations which should have occurred If 
these models were appropriate. The mathematics of existing test 
theories In current general use are log i co-deduct I ve systems, which as 
such are refutable only upon the basis of errors In the deductive 
processes. The refutation of concern here Is to the appl Icabi I Ity 
of these models to the particular kinds of data being studied. 

COMPARISONS HADE 

In addition to the testing of these hypotheses made, the 
following comparisons were made. 

1. Frequencies and proportions of "meaningfully" events 
separated by Item comparisons and aggregated by age 
level In each of these 3 categories. (W*W, W*R, and 

r*r) 

2. Frequencies and proportions of "meaningful" events 
separated by age level and aggregated across Item 
interaction In each of the 3 categories. 

3. Frequencies of opposite events by Item and category. 
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4. Frequencies of age equivalent pairs and student 
equivalent group sequences between the two 
adml nl stratlons . 

5. Continuities between pairs of alternatives across 
age levels. 

In this latter case no aggregation was used. 



Looking at the comparl tl ve distribution of proportions of W^W, 
W*R, and R*R across age may give some Idea of the degree to which 
each contributes meaningfully at different age levels. Bock (1972) 
found that wrong answers added meaningfully to the results of those 
students below the median but not above It. However, he used 
vocabulary Items which are essentially "know-guess" in format or at 
the Knowledge Level in Bloom's Taxonomy (1956). The Proverbs Test 
contains "translation" items which are at least at the Comprehension 
lev'el of BJoom's Taxonomy. If the pattern Bock found continues 
for items at a higher level of cognitive processes, then for evaluative 
p'jrposes right answers would be necessary and sufficient. Wrong 
answers might still have potential diagnostic value particularly for 
the lower scoring students. If this pattern is reversed with W*W > 
R*R at the higher ages, then wrong answers may need to become an 
important part of evaluation. 

In addition, of course, cross-referencing events by both Item 
and age level makes it possible to establish the accuracy of both 
tabulations by the match of the totals. 

The frequencies of opposites by Item and category should give 
some Indication of the amount of "noise" in these observations both 
within Item pairs and by category. In this case, a "noisy" solution 
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for wrong answer Interactions would be f (0>E)W*W = f(0<E)W*W ^ ^^5. 

N3 N3 

A look at the age equivalent pairs and student equivalent group 
sequences should give some idea of the degree of consistancy among thes 
observations and whether age equivalent or student equivalent events 
are more meaningful developmental ly. If age equivalents is the more 
stable then developmental sequence is supported over intragroup 
stability, the reverse implies the opposite conclusion. 

CONTINUITY OF INTERACTIONS ACROSS AGE 

Continuous sequences operationally defined using the following 
algorithm. 

1. The total occurance of meaningful events on any 
alternative pair across all age levels must be at 
least 6 to form a continuity. This means that 10 
percent of the possible 60 must be represented. 

2. Of these at least half of them or 3 events must be 
close enough together that not more than one age 
level separates any two events. 

3. If more than one age level separates any two 
elements in a sequence, the sequence is assumed to 
be discontinuous at that point. 

4. Less than three elements at different ages are not 
considered as a continuous sequence. 

5. A pair is considered a single age event for the 
purposes of establishing a continuous sequence, 
but Is considered with respect to the density of 
that sequence. 

6. Overlap of no more than 3 events into one of the 
four "cognitive processing leveP' subdivisions 
developed in this study Is assumed to place the 
sequence into the level of greatest density. 
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Here is a" example to illustrate these rules. 

Figure k 

"Example of a Continuous Sequence" 



AGE LEVEL 



nteractlon | 2 3 ^ 5 6 7 8 9 10 II 12 13 l^f 15 i6 I? 18 19 20 21 22 23 2h 25 26 27 28 



^0 X Qj't 



Bla"^ event within an acceptable sequence 

1^ A meaningful comparison for one of the two administration times 

w A meaningful comparison for both of the two administration times 
at this age level. (PAIRS) 

P-^Sequential event 



In this case there are 10 meaningful events at eight different 
age levels including 2 pairs. Had there been less than 6 Instead of 
lO, none of these would have been considered further. The Age Level 
6 BLANK 8 sequence Is close enough but not long enough to be considered 
to form a continuous sequence. The pair by Itself at Age Level 15 is 
"ot close enough to two others to be continuous. The event at Age 
Lftvel 21 is separated by two age levels from Its nearest neighbor 
and is, therefore not part of the continuous sequence. The pattern 
BUNK, 26, 27, 28 meets all criteria and is, therefore considered 
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to form 3 continuous sequence. The balance are assumed to be statistical 
artifacts and Ignored* Thus the density of this "continuous sequence" 
In 4 out of 9 or .50- The arrow above 27 toward 28 means that this 
Interaction was found In the first administration for age level 27 
and In the second administration for age level 28. In this case, 
this event would seem to be more a characteristic of this group 
than of the age levels. 

continuous se<|uences are found in any number, two important 
questions can be considered. First, do these tend to spread out 
across the entire age range? |f they do, then development would 
seem to Involve changes within a relatively stable pattern and a 
single answer sub-group key can be used for all age levels. On the 
other hand, if continuous sequences appear to be relatively short 
and confined to specific age level spans or to form a "stairway" 
across the age levels then a single key for all age levels is not 
sufficient to describe development in this case from these answers 
on such tests. In this latter case (i.e. the appearance of a 
"stalrwayi) strong supnort for a development sequence among answer 
patterns will be found. The presence of a "stairway will mean that 
development not only would be affecting which alternatives are 
chosen, but the between choice relationships as well. Should these 
effects Vary with the age of the iearner, a complex curved -1 ine 
pattern y^uld be ImpMed. The MULTIQUAL results (Powell, 1978a) 
already has been presented provided the implication that rectilinear 
models may be Inappropriate for evaluating learner performance. 
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The appearance of a ''stairway** would further support this non- 
llnearlty, since In this case between alternative relationships 
would be changing with age. 

Here then are the four basic questions addressed in this study. 

1) What can the pattern of answer selection across age, using only 
frequencies as the raw data, tell us about the statistical and/or 
psychological (behavioral outcome) properties of learner development? 

2) What process/product inferences (if any) might reasonable be 
drawn from these outcomes? 3) Is the total correct score a sufficient 
basis for evaluating learner achievement? k) Is the ^'Cumulative 
Learning*' hypothesis valid as a behavioral description of the nauture 
of learning events? 

RESULTS 

Considering the observations obtained, how do these compare, 
with the null hypothesis being explored by this study? Table 2 
gives these results. 



INSERT TABLE 2 ABOUT HERE 



The expectation that all interactions would be random, or that 
patterning would favor the use of right answer aggregates over any 
other procec^ure was not supported by these findings. The relative 
sizes of the cell Chi Squares speak for themseVves. 



31 



TABLE 2 



FIT OF THE FREdUENCIES 
OF MEANINGFUL EVENTS TO THE 
RANDOM ASSUPTION 



INTERACTION 
R*R W*R W*W 



0>E 



0-E 



0<E 



TOTALS 



GRAND 
TOTALS 



0 


- 60 


0 ■• 


■i ' 


12 


0 


= 633 


El 


- 15 


El 


s 


90 


^1 


= 135 


X? 


« 135.00 


X? 


■ 


67.60 


X? 


= 1837.07 


^2 




^•2 




180 


^2 




x5 


- 30 


x5 




156.8 


x5 


°> ASS. 03 


0 


- 5^0 


0 




3368 


0 


= 4731 


El 


- 570 


El 


s 


3't20 


^1 




A 


- 1.58* 


X? 


s 


0.79* 


X? 


= 31.03 




- m 


EZ 


s 


32^0 


Ez 


» itB60 


x5 


- 7.5 


>i 


s 


5.06 


xi 


= 3. '♦2 


0 


- 0 


0 


s 


220 


0 


= 39 


i\ 


«= 15 


El 


at 


90 


El 


= 135 


X? 


- 15 


X? 


s 


187.78 




= 68.27 


^2 


- 30 


Ez 




180 


Ez 


= 270 


X2 


- 30 


xi 


s 


8.89 


xi 


= 197.63 


0 


- 60 


0 




232 


0 


= 672 


E 


- 600 


E 




3600 


E 


= 5^*00 


X? 


- 151.58 


X? 


ss 


256.17 


X? 


= 1936.30 


x§ 


- 67.50 


x^ 


a 


170.75 


xi 


> 689.08 


0 


- 96'* 


E 


«i 


9600 


X? 


a 234'».05 












x§ 


= 927.33 




CELL CHI SQUARE NOT MEANIIIGFUL 
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There were two tests conducted here. In the case where the first 
range of expected values was considered, the random assumption fitting 
to the normal curve was that the ratio of 1:38:1. That Is, 0>E and 0<E 
should be no mone than 2.5 percent of the distribution. In the 
second case, since the critical value for the cell Chi Squares used 
was arbltary, the average proportion of departure SGk to 9600 was used 
Instead, This gave a ratio of l:l8:l. That Is, 0>E and 0<E should 
each represent about 5 percent of the total. In effect, this Is a 
test of symiBctry and of order. The frequencies are clearly assymetric 
and order R*R<W*R<W*w In order of meaning. 

These observations leave little question that there may be 
meaning afliong Interalternatlve Interactions. 

What are the distributive patterns of these meaningful events? 
The order presented ©n Page 10 will be followed. Table 2 gives the 
relationships among the frequencies of meaningful relationships 
(cell ^'^^ among item pairs aggregated across age. 
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Only two of the R*R relationships might be considered large. 
02 X Ql 'S '6 out of a possible bO and Qij X Q2 'S '8 out of 60. Two 
others are close to 10 percent of the possible while the other 6 out 
of 10 have ^ or less meaningful relationships between right answer 
pairs. For most of these Items, then, the '*meanl ngfuP* relationships 
between them on the right answers are probably statistical artifacts 
even though all relationships occurred In only one direction. 



TABLE 3 



SUMMARY OF 
THE REQUENCIES OF 
MEANINGFUL RELATIONSHIPS BY ITEM ACROSS AGE 



ITEM 



I 

T 
E 
M 



W*W - 57 
W*R - 32 
R*R « 16 

PAIRS - \h 
SEQUENCES - 

W*W - 67 
W*R - 23 
R*R - ^ 

PAIRS - 13 
SEQUENCES - 

W*W - ^3 
W*R - 16 
R*R - 6 

PAIRS - 5 
SEQUENCES - 

W*W - 70 
W*R - 29 
R*R - 5 

PAIRS - 11 
SEQUENCES « 



11 



W*W = 76 
W*R = 22 
R*R » ^ 

PAIRS » 12 
SEQUENCES « 8 



W*W 
W*R 

R*R 



59 
23 
18 



13 



"PAIRS = 12 
SEQUENCES « 9 

W*W « 51 
W*R « 18 
R*R - 1 

PAIRS » 5 
SEQUENCES » 3 



W*W = 89 
W*R = 22 
R*R = 1 

PAIRS = I'* 
SEQUENCES = 12 

W*W = 85 
W*R = 23 
R*R = 2 

PAIRS = 8 
SEQUENCES = 10 



W*W = 75 
W*R = 2^1 
R*R « 3 

PAIRS = 9 
SEQUENCES « 7 



3. 



TOTALb 1. WRONG BY WRONG « 672 
POSSIBLE (N^) » 5^00 
PROPORTION - .12^*^ 

2. WRONG BY RIGHT « 232 
POSSIBLE (N2) » 3600 
PROPORTION = .064^ 

RIGHT BY RIGHT » 6O 
POSSIBLE (N3) = 600 
PROPORTION « .1000 

k. AGGREGATE « 964 

POSSIBLE » 9600 

PROPORTION ■ .1004 

5 PAIRS AT SAME AGE « 103 
POSSIBLE (Ni,) « 472 
PROPORTION - .2182 



6 SEQUENCES BETWEEN: 

FIRST & SECOND TESTING = 93 
(LESS CAUSED BY PAIRS) - BO 
POSSIBLE » 472 

PROPORTION - .1695 

DIFFERENCES OF PROPORTIONS : 
K W*W W*R . 2 . 9.W; p .000 

z « 1.877; p .05 
z « 3.236; p .01 
4. PAIRS _SEQ^ . ^ ^ 2.411; p .05 
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For roost of the Item by Item Interactions, the frequency of 
occurance of wrong answer pairings is substantially larger than 3 
to 1 In favor of W*W, (I.e. the wrong-by-wrong answer Interactions). 
In fact only 3*of the 10 pairs shew a ratio smaller than this. The 
difference between proportions between W*W and R*R is almost big enough 
In favor of W*W as being statistically significant by larger than R*R. 
If the effects of Q2 X Qi and di, X Q2 were removed, then this difference 
would be highly significant. 

For most of the 10 item interactions of the 5 Items used wrong 
answer carbinations would seem to tend to more "meaningful" than 
right answer conblnatlons. Since only about l/80th of the possible 
nuni>er of Interactions have been considered here, a continuation of 
this current trend would be very likely to make this difference 
significant. 

The second consideration was the pattern among these three 
categories of events by age level aggregated across item Interactions, 
In this case the maximum possible R^R in each cell Is 20, U*R Is 120, 
and W*W Is I80. To make for easier comparison the three patterns are 
shown on one graph in proportions of the total possible for each 
category. Figure 5 gives this Information, 



INSERT FIGURE 5 ABOUT HERE 



Two observations are worthy of note. First, the R*R interaction 
frequencies are heavily loaded at the lew end of the age scale. 
Second, from about AGE Hi YEARS onwards, the W*U relationships are 



FIGURE 5 

COMPARISONS AMONG 

PROPORTIONS OF OCCl/RANCES OF 

INTERACTIONS 

ACROSS AGE 
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almost always above R*R or V*R or both. The single exception occurred 
at AGE 13^. This observ^jtion implies that the wrong answers contain 
more neaning for older but not younger children. This observation is 
the opposite to the findings reported by Bock. Ho%*evcr, his items were 
vocabulary Items which implies "know-guess" strategy Involved in 
answering* If the "know-guess" phenomenon was also operative for the 
younger children in this study, then there Is no contradiction between 
ihes3 results. 

Another item of interest arises from the comparison between the 
aggregate frequency of relationships by age across item i nte react ions 
and the sample size for each age group. Figure 6 gives these 
comparisons. 



INSERT FIGURE 6 ABOUT HERE 



It is evident from Figure 6 that sample size does directly 
Influence the frequency of meaningful relationships thoughout most of 
the range. However, it is equally evident that this Is not the only 
influence. There seems to be at least three transition points where 
the relationship frequency Is lower than sample size influence would 
predict. These transition points were used to divide the sequence 
Into four "process" levels which roughly correspond to Plaget's 
stages (if Concrete Operations comes in two parts) and with the 
stages found elsewhere (See: Powell, 1977). Such transition points 
suggest that at least one aspect of development may involve cyclic 



A COMPARISON eETlVECN 




HOTE: Reid the left hand scale as a ratio for the gr*ph I 
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Increase and decline of the degree to which Interactions may be 
meaningful. Vhe points of lowered meaningful Interaction seem to 
coincide with the transitions between stages. This observation 
seems to correspond with the transitions between stages. Th«s obser- 
vation seems to correspond to the Increase In the variety and frequency 
of errors found to occur at similar transition points (See: Powell, 
1976). It also seems to be related to Plaget's Inference that the 
psychological Schemata of a learner may need to be restructured at 
each transition point. These observations would seem to reinforce the 
possibility that development may be a non-linear phenomenon. 

The third curve shows the pattern when the frequency of Inter- 
actions Is divided by the group size. The resulting proportions 
approad^i .05 where the dashed lines are located. The peak at the left 
is conposed In large part of R*R Interactions while the one on the 
right is largely W*W interactions. Apparently, In contrast to 
Bock's (1972) observation. In this study using a "higher mental 
process" test, wrong answer interactions seem to Increase In meaning 
with age. This phenomenon seems particularly evident once the Impact 
of group size Is removed from these data. 

In the later age levels (See: Figure 8) the number of right 
answers seems to fall off sharply^, while the wrong-by-wrong Interactions 
continue to Increase In importance. 

The frequencies of all six possible relationships (either 0>E or 
0<E for R*R, W*R, and W*W) are given in Table 
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This table Is self explanatory. Well under 1 percent of the 
total frequency of observations and 10 percent of the actual observations 
occur In relationships whose direction would imply meanlnglessness to 
wrong-by-wrong answer Interations. Using the cell Chi Square > 2.^ 
criterion produced very little "noise". None of these occur with the 
R*R Interactions. The frequency of statistical artifacts would 
appear to be satlsfylngly low. Such low noise levels would support 
the findings of high levels of variance accounted for reported elsewhere 
(See: In particular Powell, 1976). 

The frequencies of pairs vs. sequences was given in Table 2. 
Pairs exceed sequences. Some 13 sets of pairs were consecutive. In 
this case the sequence was caused by the pairing. If these are moved 
then the proportion of sequences is slngificantly smaller than the 
number of pairs. This observation would seem to have two Implications. 
First age level seems to be more Important than group membership In 
the formation of a continuous sequence. Second a span of 5 months 
duration may. be sufficiently large to make a significant discrimination 
using this procedure. Using Total Correct scores alone, an age span 
three times as big would be needed before group differences become 
significant. 

Figure 7 gives the "sontlnuous sequences" which were found 
among these data. These continuous sequences are arranged 
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TABLE h 

FREQUENCIES OF ALL 
POSSIBLE MEANINGFUL REUTIONSHIPS 





0>E 


0<E 


TOTALS 


Q? X Q1 


51 


6 


57 


03 X ai 


65 






dh X Q1 


41 


2 




05 X Q1 


65 


c 

5 


70 


03 X 02 


7h 


2 


76 


04 X 0? 


56 


3 


59 


05 X 02 


48 


3 


51 


0^ X 0^ 


88 


1 


89 


05 X i3 ■ 85 


- 


85 


05 X 0^ 


60 


15 


75 


TOTALS 


633 


39 


672 


02 X 01 




32 


32 


03 X Ql 




23 


23 


0^ X 01 


1 


15 


16 


05 X 01 


2 


27 


29 


03 X 02 


2 


20 


22 


0^ X 0? 


1 


22 


23 


05 X 02 


3 


15 


18 


03 X 03 


1 


21 


22 


05 X 03 




23 


23 
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according to the sections of the four cognitive levels proposed above. 



INSERT FIGURE 7 ABOUT HERE 

The "stalrvray** pattern discussed above Is clearly evident among 
many of these relationships. In the case of the W*W, many of the 
continuities either spread across or discontinously represent more 
than one level. Only a very few (5 out of 32) represent three or more 
levels and a larger proportion (about one third) only one level. 

Hore than 86 percent of these sequences have a density of at 
least .50. That Is, they account for at least half of the number of 
meaningful comparisons for that interaction. At a ration of 32 to 3, 
the W*W seems to be more important than the R*R and also at least as 
complex. Unravelling the hierarchal structure of this complex 
pattern will probably prove to be very challenging. Finally the 
marginal notations CON.RT and IRQ refer to the classification of these 
comparisons In the earlier study (Powell, 1977), of the 9 which 
should have been present, 6 appeared. Only the classification OS, which 
should have had 3 meaningful relationships did not appear. In other 
words, two thirds of the homogeneous subgroup I ng of Items have 
replicated from one sample to the second. 

In addition, several wrong alternatives show meaningful sequences 
with two or more other wrong alternatives. It may be possible. If this 
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d)servatlon proves stable, to use such blforkations to identify 
meaningful subgrouplnQS of a sample or a population. 

Another comparison which Is Illuminating Is between the MULTIQUAL 
study and this present one. FIgUre 8 gives the distribution of the 
answers by age level on Item 1 of the Proverbs Test , with the raw data 
distribution on the same Item (See; Figure 3a) superimposed. This 
comparison Is also presented In Figure 8. 



INSERT FIGURE 8 HERE 



The fit Is remarkable but it is not idenllCdi since It Is 
necessary to shift the component curves to the right and to change 
their relative vertlcle placement and to expand the time Interval. 
Their shape Is not changed to get this fit. In mathematical terms, 
a phase shift, a change of amplitude, and a change of period are 
required to achieve this effect. Since the new sample Is 50 percent 
city center and the original one was entirely suburban and this 
sample was collected In October/March of 77/78 and the earlier In 
May 75, a fit as good as this would be unlikely by chance. These data 
would seem to present a reasonably good replication If the position 
shifts are a legitimate procedure for such comparisons. 

Arc these positional shifts a Grade 5 event, a city center 
event, or a time of y«ar event? If they are related to population 
mix then the patterns of answer Interactions may be useful for 
Identifying the response selection behaviors of specific subgroups 
In a population, m this case It may be possible to distinguish 
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between cultural difference and mental retardation by using Inter- 
alternatlve Interactions. 

The MULTlCiUAL model version does not replicate nearly as well 
as the original proportions. This Implies that although curved 
lines are necessary, quadratic and cubic curves may not be appropriate 
a fact already evident when the extended range is presented as in 
Figure 8, and from the study of these same data conducted by Yu (1977) 

Another feature of. this pattern Is that the right answers 
decline and the wrong answers Increase with each of the three 
"transitions'* which were Identified from criteria independent of 
selection frequency. This observation supports the possibility of 
non-linear transformations may be a fundamental characteristic of 
development. This latter conclusion is further supported by the 
replication validation of these configurations mentioned earlier. 
In effect, irregularities in answer distributions may be meaningful 
and perhaps should not be removed, arbitararl ly, by replacing curves 
such as those found In Figure 8 by the "best fitting" straight lines. 
Straight line approaches to these data, such as Total Correct Scores 
may not be appropriate. 

Also of Interest Is the appearance among the oldest learners of 
yet another drop In fight answer frequency corrbined with a rise In 
wrong answer frequency. Does this observation Imply that "Formal 
Operations" the high point for Plaget's theory. Is not the last stage 
In development, as he assumes It to be? 

DISCUSSION AND CONCLUSION 

Of all the predictions made out the developmental patterns of 
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answers, particularly among wrong answers all compr/lson supported a 
non-1 Inear developmental pattern. Wrong answers remained generally 
superior ^o right one for meaningful Information. It is Important to 
point out once again that this information is related to the Inter- 
actions between alternatives and Is independent of selection frequency 
of particular alternatives. Selection frequencies may well be 
Important as well. The point here seems to be that a large portion 
of the dynamics of the developmental processes may be lost when 
selection frequencies within Items and particularly selection fre- 
quencies of right answers alone are the sole criterion for 
evalust?^ le*»rn«r performance. 

Using the Interactions among alternatives Instead of cumulative 
frequencies it appears that learning may not be cumulative as the 
counting of right answers or addition of subscores Implicitly assumes. 
Instead, development would seem to be a complex sequence of subtle 
transformations among thought processes which seem, from these data, 
to be strongly characterized by sequences of relationships and 
systematic relationship changes. Changes among the '"wrong" answers 
appear much more frequently than among the right ones. This current 
study Is now the third, using different age groups and tests in which 
wrong answers have appeared to be more' "meaningful" than right 
answers, 'n the other two studies (Powell 1970, 1976) wrong answers 
were first to appear, and In aggregate accounted for the largest 
proportion of the total variance, when multiple regression procedures 
were used to predict Independent achievement measures. 
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The nx>st Interesting of all these observations was the fact that 
W*W Interactions were the most mcanlgful for the higher age levels. 
The Implication drawn from this observation is that Knowledge level 
Items may function adequately under the "know-guess" hypothesis and 
classical test theory may be appropriate for such items. However, 
for the so called "higher mental procees" Items, the "know-guess" 
hypothesis may become Increasingly Invalid. This event may arise, 
because In a question requiring thought; there may be more than one 
reasonable answer . This possibility has already been explored with the 
8 year olds discussed above. 

The four questions posed earlier can now be answered. This 
analysis of the Interactions among five out of ^ questions, represents 
l/80th of the possible nunJber of Interactions on this test. Even this 
small portion of these data has already shown one statistically 
significant discrimination using a 5 month age interval over an age 
range In excess of 12 years. Such fine tuning has generally proven 
to be impossible for a group test of so few items uslQg Total 
Correct scores to assess learner status. 

First the patterns of answer selection across age suggest that 
learning may Involve a complex hierarchically ordered sequence of 
Interactive (non-linear) events. 

Second, It would appear that, where "higher mental processes" 
are concerned, process may possibly take precendence over product. 
A stronger picture of the dynamics of development seems to emerge when 
the within event frequencies are suppressed In favor of between 
event Interactions than when frequencies are considered by themselves. 
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Pattern analysis would seem to be much more meaningful than frequency 
counts* 

Third, Total Correct Scores are probably not adequate for the 
evaluation of learner progress at levels above simple recall. 

Fourth, for the "higher mental processes" the "Cumulative Learning" 
hypothesis would seem to be refuted. In any case, since a single 
contrary event !s necessary to refute an "all" hypothesis, It can now 
be conclusively said that not all learning fs cumulative. 

it Is now possible to address the question raised In the title of 
this paper. Devclopmcntsl Information Is Indeed available from wrong 
answers. Apparently, Total Correct scores by themselves are Insufficient 
to describe the developmental status of a learner. Not only is It 
necessary to know hew many answers of specific types are chosen, 
but which answers and perhaps even why these were chosen. There Is 
evidence not reported here (See Powell » 1978b) to suggest that this 
latter Information may be available Indirectly from a parallel 
administration of a personality measure. 

It appears, particularly for higher process tests that counting 
the right answers oversimplifies the situation, either Ignoring or 
obliterating critical Informatics. An approach to test interpretation 
which considers the pattern of particular answers (both right and 
wrong) selected would seem to be necessary to determine the developmental 
status of a learner. It also appears that a single administration of 
a content oriented achievement test may not be sufficient to this 
task. How a problem was solved may be directly available from the 
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•nswer chosen but why It was chosen may require more information. 
Thts information Is obtainable from Interview (Powell. 1977) or from 
self-report (Powell. 1968). It may also be available indirectly 
from the parallel administration of personality meansures (Powell. 
1978fa) . 

in any event, analysis of test results on an answer-by-answer 
basis seems to be more meaningful than analysis on an Item by item 
basis. Item by Item analysis would seem to be more meaningful than 
any form of aggregation and particularly more meaningful than Total 
Correct scores. The fine tuning which may be possible using pattern 
analysis could hold considerable psychometric promise. 

Although the current sample Is large and the findings strongly 
Indlcltive in these particular directions, there are still many 
unanswered questions. It may be true that wrong answers are more 
"neanlngful" than right answers without this conclusion being partic- 
ularly useful to educators. Of what benefit might wrong answers be to 
educators if they were to try to use them? 

Are the patterns found here characteristic of all children or 
only of this locality? Would the same pattern replicate in New York 
or Terra Haute or London. England? 

If the Irregularities in these curves are meaningful as well as 
the general trends what do these irregularities mean? Further 
exploration of the current data set which contains a personality test 
and Is a repeated measures design, could possibly throw much, light on 
these issues. In fact, it Is already doing so. but this Is the subject 
of another paper. 50 



37 



The most Intriguing aspect of the results of this current study 
to the educator and to the measurement aspect Is the Implication that 
.only Knowlege level questions may be stable under the "know-guess" 
hypothesis. Does this observation Imply that Total Correct Scores 
may be valid only for recall and direct recognition Items? 

Does this Implication also mean that all studies which used 
total correct scores for basic data (analysis of variance studies for 
instance) where more than recall was the concern, will need to be 
reworked? 

Have educators, In pursuit of "right" answers forced Inalnlty 
and triviality upon the learning process in order to get stable test 
results? Have these educators been aided and abbetted by measurement 
theorists who have built their theories upon the random normal varlate 
model and upon the Total Correct Scores model from classical test theory? 
Has this problem been further compounded by the testing experts who 
have built their standardized tests and normcd or outcome referenced 
these tests based upon Total Correct Scores, and/or totaH subtest scores? 

Specifically, has the nature of the observations being made In 
educational measurement — namely, the aggregation of frequencies of 
one arbitrary class of events — prevented us from observing the most 
important learner-environment transactions in the learning process? 
Has this restriction in the observational framework employed effectively 
restricted educational outcomes to the recall of inalnitles and trivlaliti 
In order to achieve stable test results? Are the claims of critics 
correct — for the wrong reason? Does this possibility explain many of 
the puzzling outcomes In educational research? 
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For instance, do these observations explain the lew level of 
success on Intelligence measures often experienced by individuals 
displaying subcultural differences? Do these findings help to 
explain why adults tend to do less well on intelligence measures than 
do adolescents? Do they help to explain why the profound thinkers 
tend to be less successful at formal schooling than more convergent 
thinkers? Do these outcomes help to explain why the expected differential 
effects from different educational interventions have failed to 
appear with any conslstancy? 

Do the subgroupings reflect cultural variables, learning style 
variables and the like In such a manner that it might be possible to 
get a better match between learner characteristics and teaching strategy 
than is now possible? 

Would educational procedures focussing upon how people solve 
problems be more motivating than telling learners the solutions others 
have found? Could educators improve their ability to track learner 
progress using answer pattern analysis over Total Correct Scores? If 
so would this Improvement be enough to warrant the extra difficulties 
Involved obtaining this additional information. 

In any case, the patterns among answers would seem to be far more 
complex than a Total Correct Score either implies or provides information 
about. A strong but complexly interactive developmental pattern seems 
evident. Test theory and evaluation procedures may be back to square 
one, but this time the outcomes from the available Information would 
seem to be profound rather than superficial. The possibility of 
determining how learners attack, problems, of identifying learner 
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subgroups and of getting considerable status Information from relatively 
short tdsts may emerge from these findings. 

In part, the current observations seem to explain why current 
measurement practice seems to produce superficial results, by Implying 
that Total Correct scores may, at best, be superficial In evaluating 
learner progress, and at worst may be Invalid to that process because 
an Inappropriate mathematical model may be being employed. 

Huch morti research Is needed Into this new area before the details 
of the ramifications of these findings are clarified. However, this 
present study may well be a good beglnlng. 

In this case, where do we go from here? It Is already apparent 
that there arc too many meaningful Interactions among wrong alternatives 
within one Item for a unique classification of the whole group. The 
multiple representation of aUernatlvcs within Items would Imply that 
alternative Interactions may Identify subgroups as well as positions 
in the developmental sequences. Further exploration of this problem 
Is needed before some form of scoring procedure or pattern analysis 
proceudre can be developed which will extract the useful Information 
from between alternative Interaction In a manageable form. 

Using Item No. 1 only \^ appears that relationships between wrong 
alternatives and personality variables are much more common than 
between right alternatives and these same variables. If this 
situation persists throughout, perhaps personality factors can be 
Identified which may help to distinguish between the subgrouping of 
subjects mentioned above. 
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Once a clear basis for some form of scoring procedure or answer 
pattern descriptions which distinguishes among subects has been found, 
aHlfflpti bifl Hlid« to predict the patterns In the second administration 
from the patterns In the first. 

If "good enough" predictions can be established (say r^JOJ) then 
patterns In the second administration may be matched with the next 
higher age level In the first administration. Hopefully, this "leap 
frog" approach may reveal fairly clearly definable developmental 
pathways. These pathways might persist through several age levels; perhaps 
even to joints of school exit (about which data are available). 

The branch points and other critical characteristics of these 
pathways may be determinable. Once the developmental pathways 
(!f such can be found) are mapped, attempts can be made to use this 
same Instrument paclcage with new subects about whom additional Infor- 
mation can be obtained. The impacts of various intervention procedures 
upon pathway progress could, at this point be studies in depth. 
Penetration Into the critical and/or central aspects of teaching/learning 
Interactions may be posr»ible. Hopefully, the findings from this data 
base using the prcceudres employed In the present study can actually 
be pushed as far as this speculation proposes. 

Judging from the fact that most age levels In this study are 
represented by at least one commencement or termination of a sequence, 
perhaps a response pattern analysis on this '»0 item test could Identify 
meaningful changes over relatively short time intervals. One 80th of 
the Item Interactions on this test have produced more meaningful 
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sequences (32) than age levels (30) used In this study. There are 
nearly three fourths of a million potentially meaningful alternatlve- 
by-alternatlve Interactions on this one test If using both admin- 
istrations of 5 month age aggregates. If current ratios continue, 
throughout, much accuracy of placement and tracking may be well 
within reasonable possibilities. 

The ultimate goal Is to try to Identify the impacts of Intervention 
procedures upon developmental process outcomes. Since wrong ar^n^ers 
seem to be a powerful source of process information, and Inter-event 
relationships seem to add Information to unrelated frequency aggregates, 
this approach would seem to hold promise. Could "effective teaching" 
get a better than current definition in this way? Could learning be 
Improved, and by hew much? These are all questions which may be 
answerable, at least in part, from the alternative procedures described 
In detail In this paper. All or most of these problems might be 
attacked from this present data set. 

This research plan is the direction the present author Intends to 
proceed. Anyone who Is interested in pursuing any part of this 
complex problem using the data set employed herein is welcome to do 
so for the price of a computer tape, some postage, perhaps some phone 
calls and a willingness to share findings. Please join the team. 
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